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1.  Introduction  and  Background 


Computational  linguistics,  as  a  discipline,  centers  on  enhancing  the  capability  of  computers  in 
translating  one  natural  language  into  another,  which  is  called  machine  translation  (MT). 
Originally,  it  was  assumed  that  MT  would  be  as  simple  as  compiling  a  multilingual  lexicon; 
however,  such  methods  met  with  only  limited  success.  Today,  MT  relies  on  the  ever-increasing 
capacity  of  computers  to  ingest  and  learn  from  large  amounts  of  bilingual  data  from  human 
translators,  or  ground  truth  data.  This  method,  statistical  machine  translation  (SMT),  models 
patterns  of  translation  by  assigning  weighted  probabilities  to  bilingual  correspondences  derived 
from  ground  truth  data. 

1.1  The  Need  for  Bilingual  Data 

To  build  an  SMT  engine,  huge  amounts  of  bilingual  data  are  required,  both  to  serve  as  ground 
truth  and  to  properly  calibrate  the  governing  heuristics;  more  data  generally  serves  to  make  an 
SMT  engine  more  accurate.  For  some  languages,  bilingual  data  that  pairs  English  with  another 
language  are  numerous  and  easy  to  come  by.  These  languages  include  French,  German,  and 
Spanish,  all  of  which  are  Indo-European.  Some  non-Indo-European  languages,  like  Japanese  and 
Mandarin  Chinese,  show  increasing  amounts  of  bilingual  data  with  English.  However,  there  is 
little  available,  bilingual  data  between  English  and  languages  of  remote,  less  developed  regions 
of  the  world — Afghanistan  (Dari  and  Pashto),  for  instance.  Yet  it  is  regions  like  this  where  the 
Anny  goes  and,  hence,  where  it  needs  MT.  To  compensate  for  the  lack  of  data  for  building  SMT, 
the  Army  has  invested  resources  in  the  production  of  bilingual  text-data  for  Dari-English  and 
Pashto-English. 

1.2  The  Production  of  Bilingual  Data  for  SMT 

The  production  of  high-quality,  bilingual  text  that  is  usable  for  building  SMT  is  long  and 
involved.  The  primary  processes  are  shown  in  figure  1 . 

Finding  Parallel  Text  It  begins  with  locating  a  source  of  text-data  written  with  fairly  equivalent 
versions  in  both  English  and  the  target  language,  (as  opposed  to,  say,  a  Dari  text  and  an  English 
precis);  this  is  called  parallel  text. 

Correction  and  Normalization  After  location,  the  data  must  be  copied  into  a  word-processor  so 
that  a  language  expert  may  correct  any  errors.  The  expert  must  employ  a  standard  so  as  to  ensure 
uniform  stylistic  and  character  fonnats. 

Segmentation  Then,  the  expert  must  divide  the  text  into  segments  small  enough  to  be  useful  to 
an  SMT  engine.  Again,  a  general  standard  must  be  adopted  and  applied  to  every  sentence.  This 
can  present  a  particular  challenge  in  the  alignment  phase  if  the  sentences  of  the  source  language 
and  target  language  texts  do  not  enjoy  a  one-to-one  correspondence. 
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Figure  1.  The  pipeline  process. 
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Alignment  Formatting  After  that,  the  segments  of  one  language  must  be  aligned  with  those  of 
the  other.  The  aligned  text  is  must  then  be  converted  into  Translation  Memory  exchange  (TMX) 
format  to  be  compiled  into  a  cohesive  corpus.  TMX  data  constitutes  a  pair  of  aligned  segments 
and  a  probability. 

This  entire  procedure  is  painstakingly  carried  out  by  a  single  language  expert.  Understandably, 
this  method  is  not  conducive  to  maximum  output  from  the  expert.  To  optimize  the  use  of  the 
Anny’s  time  and  resources,  it  was  decided  that  automation  should  be  introduced  to  aid  the  expert 
whenever  appropriate. 


2.  Pipelines  and  Experiments 


We  refer  to  the  series  of  automated  processes,  designed  to  increase  the  amount  of  bilingual 
parallel  *.tmx  text  produced  and  authenticated  by  the  language  expert,  as  the  Pipeline. 

Originally,  the  Pipeline  featured  a  mix  of  open-source  programs,  as  well  as  program  code  written 
by  Mark  Arehart  (MITRE),  and  it  was  tailored  to  extract  data  only  from  a  single  source — the 
online  newspaper,  Sada-e  Azadi.  An  enhanced  version  of  the  Pipeline  incorporated  additional 
open-source  content,  in  addition  to  program  code  written  by  John  Morgan  and  Will  Tanenbaum. 

2.1  The  Original  Pipeline 

The  web  site,  Sada-e  Azadi,  displays  the  online  version  of  an  International  Security  Assistance 
Force  (ISAF)  publication.  Its  features  include  topics  like  current  events,  politics,  economy, 
media,  entertainment,  and  health.  It  also  offers  a  question-and-answer  column,  called  -Baba 
Jan.”  Each  article  is  offered  in  three  languages:  Dari,  English,  and  Pashto. 

2.1.1  Initial  Harvest 

To  acquire  this  data,  we  used  the  open-source  program,  Wget  ( 1 ),  which  downloads  html- 
annotated  text  from  web  sites. 

2.1.2  Working  Storage 

The  Pipeline  then  accessed  a  routine  to  write  the  html-annotated  text  of  each  Sada-e  Azadi  article 
accessed  to  one  of  three  directories,  based  on  the  natural  language  of  the  article. 

2.1.3  Extraction:  Tag  Stripping 

From  there,  the  Pipeline  passed  control  to  the  open-source  parser,  Beautiful  Soup  (2),  which 
fdtered  out  the  html  annotation,  as  well  as  any  non-text  data,  such  as  pictures,  sound  fdes, 
videos,  and  links. 
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2.1.4  Return  to  Storage 

The  Pipeline  then  saved  the  text  from  each  article,  in  block  paragraph  fonn,  as  a  text  data  file  in 
the  appropriate  language  directory. 

2.1.5  Text  Clean-up 

Prior  to  segmentation,  the  Pipeline  accesses  a  routine  for  adjusting  selected  properties,  especially 
punctuation,  of  the  resulting  text.  This  routine  standardizes  certain  language-specific  characters, 
such  as  the  Arabic  period  and  colon,  and  replaces  stylized  smart-quotation  marks  with  the 
generic  version. 

2.1.6  Segmentation  and  Alignment 

Finally,  the  Pipeline  accessed  a  sentence-level  segmenter,  which  output  the  text  in  a  fonnat 
acceptable  as  input  to  SMT  engines.  When  applied  to  both  sides  of  the  parallel  corpus  data,  the 
segmentation  step  contributed  to  the  alignment  process,  which  matched  segments  by  order  of 
occurrence,  before  the  paired  segments  were  presented  to  the  language  expert  for  data  quality 
control  processes. 

2.2  The  Enhanced  Pipeline 

The  original  Pipeline  featured  a  fairly  simple  segmenting  algorithm  based  on  end-of-sentence 
punctuation.  Essentially,  the  segmenter  would  create  a  segment  break  at  periods  or  similar  end- 
of-sentence  symbols,  such  as  question  marks.  Despite  efficient  handling  of  notable  exceptions  to 
this  rule,  particularly  title  abbreviations  (Dr.,  Mr.,  Ms.,  etc.),  this  was  still  a  fairly  inelegant 
solution  to  the  problem.  Sentences  neither  always  contained  the  exact  same  information  from 
language  to  language,  nor  were  they  of  consistent  length. 

2.2.1  Punkt  Pickle  Segmentation 

To  better  segment  the  text,  the  Natural  Language  ToolKit  (3)  (NLTK),  particularly  the  Punkt 
tool,  was  used.  When  calibrated  using  a  critical  mass  of  data  in  a  given  language,  Punkt 
generates  a  language-specific  segmenter  called  a  Pickle.  The  designers  of  Punkt  included  12 
Pickles  with  the  NLTK,  including  one  for  English.  A  Dari  Pickle  was  generated  using  a  corpus  to 
which  the  Anny  already  had  access.  Thus  far,  a  Pashto  Pickle  has  yet  to  be  created  due  to  an 
insufficient  volume  of  ground-truth  data.  The  segments  created  by  the  Pickles  tended  to  be  far 
more  sensible  and  coherent  than  those  created  by  the  original  segmenter. 

2.2.2  Bilingual  Sentence  Aligner  Alignment 

An  open-source  Perl  script  called  the  Bilingual  Sentence  Aligner  (4)  (BSA)  was  employed  to 
align  the  data  in  parallel  segments.  The  BSA’s  accuracy  improves  with  ever  greater  volumes  of 
data.  Thus,  given  the  large  number  of  bilingual  data  segments  produced  from  the  first  six  steps  in 
the  Pipeline,  we  expected  improved  alignment  accuracy  and  a  speeding  up  of  the  language 
expert’s  data  quality  control  process. 
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2.3  The  Experimental  Procedure 

Admittedly,  the  changes  in  processes  constituting  both  the  Original  Pipeline  (OP)  and  the 
Enhanced  Pipeline  (EP)  introduce  new  opportunities  for  error.  Nonetheless,  for  the  first 
experiment,  it  was  hypothesized  that  the  time  saved  with  automated  Harvesting,  alone,  would 
more  than  compensate  for  any  additional  errors.  Moreover,  to  confirm  that  decreases  in  time  and 
increases  in  efficiency  were  caused  by  automation  in  Traditional  (T)  versus  OP  data  preparation, 
and  by  Segmentation  and  Alignment  improvements  in  OP  versus  EP  data  preparation,  an 
experimental  framework  was  designed  to  observe  the  time  and  efficiency  of  work  perfonned 
under  the  three  conditions — T,  OP,  and  EP — and  a  second  experiment  (OP  versus  EP)  was  also 
conducted. 

To  compare  efficiencies  of  work  perfonnance  under  the  three  conditions,  10  articles  were 
selected  from  the  Sada-e  Azadi  web  site.  The  articles  were  of  similar  length,  about  25  lines,  plus 
or  minus  one  line.  Four  articles  were  about  politics,  four  involved  health,  and  two  discussed 
media  and  entertainment.  Five  articles  were  randomly  chosen  for  each  version  of  the  Pipeline: 
two  from  politics,  two  from  health,  and  one  about  media  and  entertainment.  Both  versions  of  the 
Pipeline  featured  the  same  Harvest  and  Extraction  processes;  they  first  diverged  at  the 
Segmentation  stage.  As  noted  in  previous  sections,  the  OP  used  a  basic  segmenter  and  aligned 
each  language’s  segments  solely  according  to  the  order  in  which  they  appeared.  The  EP  used 
Punkt  and  the  BSA  for  Segmentation  and  Alignment.  To  detennine  which  Pipeline’s  use  effected 
greater  efficiencies,  the  language  expert  was  timed  from  start  to  finish  in  the  performance  of  his 
data  quality  control  work  of  aligning  and  correcting  any  mistakes  in  the  bilingual  parallel 
versions  of  the  10  articles,  five  of  which  were  processed  by  OP  and  five  by  EP.  The  versions 
were  alternated  to  mitigate  possible  learning  effects  based  on  order  of  presentation. 


3.  Results  and  Discussion 


The  value  of  automation  was  confirmed  by  the  results  of  both  experiments.  As  suspected, 
without  any  automation,  the  expert  harvested  and  aligned  only  a  tiny  proportion  of  what  he 
aligned  in  the  same  amount  of  time  with  automated  assistance.  The  difference  in  volume  of  data 
aligned  was  immense,  such  that  it  invalidated  any  need  for  statistical  comparison.  In  comparing 
the  two  versions  of  the  Pipeline,  the  differences  were  also  quite  pronounced.  We  calculated  the 
mean  time  required  by  the  language  expert  to  correct  the  paragraphs  processed  by  the  OP  and  the 
EP.  The  times,  in  minutes,  were  24.036  and  1.978,  respectively.  A  t-test  found  a  statistically 
significant  difference  between  groups,  t  =  7.257  with  4  degrees  of  freedom  (P  =  .002). 

While  not  constituting  empirical  evidence,  the  subjective  opinion  of  the  expert  about  the  two 
versions  of  the  Pipeline  was,  nevertheless,  solicited  and  validated  our  experimental  results.  He 
found  a  marked  increase  in  difficulty  when  attempting  to  reconcile  the  lines  produced  with  the 
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OP,  when  compared  with  that  found  using  the  EP,  due  to  spurious  Segmentation  and  Alignment. 
It  was  also  his  opinion  that  the  OP  performed  drastically  worse  on  articles  with  a  relatively 
greater  number  of  sentences. 


4.  Summary  and  Conclusions 


The  results  of  the  experiments  indicate  an  undeniable  advantage  using  automation  for  harvesting 
and  processing  bilingual  parallel  text  data.  Whereas  full  automation  is  not  yet  feasible,  the 
addition  of  automated  tools  has  proven  an  invaluable  aid  for  language  experts.  A  marked 
increase  in  efficiency,  similar  to  that  gained  through  use  of  the  OP  and  the  EP,  can  lead  to  a 
comparable  growth  in  SMT  capability. 
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5.  Software 


1.  Scrivano,  Giuseppe;  Niksic,  Hrvoje.  The  GNU  Project  (Version  1.12)  [Software].  Available 
from  http://ftp.gnu.org/gnu/wget/,  2009. 

2.  Richardson,  Leonard.  Crummy  (Version  3. 1.0.1)  [Software].  Available  from 
http://www.crummv.eom/software/BeautifulSoup/#Download,  2009. 

3.  Willy;  Bird,  Steven;  Loper,  Edward;  Nothman,  Joel.  Natural  Language  Toolkit  (Version  2.0) 
[Software].  Available  from  http://www.nltk.org/download  Algorithm:  Kiss,  Tibor  and 
Strunk,  Jan  (2006):  Unsupervised  Multilingual  Sentence  Boundary  Detection.  Computational 
Linguistics  32:  485-525,  2006. 

4.  Moore,  Robert.  Microsoft  (Version  1.0)  [Software].  Available  from 
http://research.microsoft.com/en-us/downloads/aafd5dcf-4dcc-49b2-8a22-f70551 13e656/, 

2003. 
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List  of  Symbols,  Abbreviations,  and  Acronyms 


ARL 

U.S.  Army  Research  Laboratory 

BSA 

Bilingual  Sentence  Aligner 

EP 

Enhanced  Pipeline 

ISAF 

International  Security  Assistance  Force 

MLCB 

Multilingual  Computing  Branch 

MT 

machine  translation 

NLTK 

Natural  Language  ToolKit 

OP 

Original  Pipeline 

SMT 

statistical  machine  translation 

T 

Traditional 

TMX 

Translation  Memory  exchange 
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NO.  OF 

COPIES  OGRANIZATION 

1  ADMNSTR 

ELEC  DEFNS  TECHL  INFO  CTR 
ATTN  DTICOCP 

8725  JOHN  J  KINGMAN  RD  STE  0944 
FT  BEL  VOIR  VA  22060-6218 

1  CD  OFC  OF  THE  SECY  OF  DEFNS 
ATTN  ODDRE  (R&AT) 

THE  PENTAGON 
WASHINGTON  DC  20301-3080 

1  US  ARMY  RSRCH  DEV  AND  ENGRG  CMND 

ARMAMENT  RSRCH  DEV  &  ENGRG  CTR 
ARMAMENT  ENGRG  &  TECHNLGY  CTR 
ATTN  AMSRD  AAR  AEF  T  J  MATTS 
BLDG  305 

ABERDEEN  PROVING  GROUND  MD  21005-5001 

1  US  ARMY  INFO  SYS  ENGRG  CMND 

ATTN  AMSEL  IE  TD  A  RIVERA 
FT  HUACHUCA  AZ  85613-5300 

1  COMMANDER 

US  ARMY  RDECOM 

ATTN  AMSRD  AMR  W  C  MCCORKLE 

5400  FOWLER  RD 

REDSTONE  ARSENAL  AL  35898-5000 

1  US  GOVERNMENT  PRINT  OFF 

DEPOSITORY  RECEIVING  SECTION 
ATTN  MAIL  STOP  ID  AD  J  TATE 
732  NORTH  CAPITOL  ST  NW 
WASHINGTON  DC  20402 

6  US  ARMY  RSRCH  LAB 

ATTN  IMNE  ALC  HRR  MAIL  &  RECORDS  MGMT 

ATTN  RDRLCIIT  W  TANENBAUM 

ATTN  RDRL  CII  T  J  J  MORGAN 

ATTN  RDRL  CII  T  S  LAROCCA 

ATTN  RDRL  CIO  LL  TECHL  LIB 

ATTN  RDRL  CIO  MT  TECHL  PUB 

ADELPHI  MD  20783-1197 
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